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Abstract 

We consider the problem of calculating learning curves (i.e., average generalization per- 
formance) of Gaussian processes used for regression. On the basis of a simple expression for 
the generalization error, in terms of the eigenvalue decomposition of the covariance function, 
we derive a number of approximation schemes. We identify where these become exact, and 
compare with existing bounds on learning curves; the new approximations, which can be used 
for any input space dimension, generally get substantially closer to the truth. We also study 
possible improvements to our approximations. Finally, we use a simple exactly solvable learn- 
ing scenario to show that there are limits of principle on the quality of approximations and 
bounds expressible solely in terms of the eigenvalue spectrum of the covariance function. 



1 Introduction: Gaussian processes 

Within the neural networks community, there has in the last few years been a good deal of excitement 
about the use of Gaussian processes as an alternative to feedforward networks (see e.g. (Williams and 
Rasmusscn, 1996|; IWilUams, 1997| [Barber and WiUiams, 199^; poldberg et al., 1998| [Sollich, 1999a]; 



Malzahn and Opper, 2001). The advantages of Gaussian processes are that prior assumptions about 
the problem to be learned are encoded in a very transparent way, and that inference — at least in the 
case of regression that we will consider — is relatively straightforward; they are also 'non-parametric' 
in the sense that their effective number of parameters ('degrees of freedom') can grow arbitrarily 
large as more and more training data is collected. 

One crucial question for applications is then how 'fast' Gaussian processes learn, i.e., how many 
training examples are needed to achieve a certain level of generalization performance. The typical 
(as opposed to worst case) behaviour is captured in the learning curve, which gives the average gen- 
eralization error e as a function of the number of training examples n. Several workers have derived 
bounds on e(n) ([N/bchcUi and Wahba, 198l[; [Plaskota, 19"90| ; [Opper, 1997[ ; [Trecate et al., 1999|; [Oppei 
and VivarcUi, 1999; Williams and Vivarelli, 2000) or studied its large n asymptotics ( ^ilvermaii 



198£; Rittcr, 1996). As we will illustrate below, however, the existing bounds are often far from 



tight; and asymptotic results will not necessarily apply for realistic sample sizes n. Our main aim 
in this paper is therefore to derive approximations to e(n) which get closer to the true learning 
curves than existing bounds, and apply both for small and large n. We compare these approxi- 
mations with existing bounds and the results of numerical simulations; possible improvements to 
the approximations are also discussed. Finally, we study an analytically solvable example scenario 
which sheds light on how tight bounds on learning curves can be made in principle. Summaries of 
the early stages of this work have appeared in the conference proceedings ( ^oUich, 1999a , b|). 

In its simplest form, the regression problem that we are considering is this: We are trying to learn 
a function 0* which maps inputs x (real- valued vectors) to (real- valued scalar) outputs 9*{x). We are 
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given a set of training data D, consisting oin input-output pairs {xi, yi); the training outputs yi may 
differ from the 'clean' target outputs 0*(xi) due to corruption by noise. Given a test input x, we are 
then asked to come up with a prediction 6{x) for the corresponding output, expressed either in the 
simple form of a mean prediction 0{x) plus error bars, or more comprehensively in terms of a 'predic- 
tive distribution' P{9{x)\x, D). In a Bayesian setting, we do this by specifying a prior P{0) over our 
hypothesis functions, and a likelihood P{D\9) with which each 6 could have generated the training 
data; from this we deduce the posterior distribution P{9\D) cx P{D\6)P{6). If we wanted to use a 
feedforward network for this task, we could proceed as follows: Specify candidate networks by a set 
of weights w, with prior probability P{w). Each network defines a (stochastic) input-output relation 
described by the distribution of output y given input x (and weights w), P{y\x, w). Multiplying over 
the whole data set, we get the probability of the observed data having been produced by the network 
with weights w: P{D\w) = YVi=i ^(uiI^I-t'^)- Bayes' theorem then gives us the posterior, i.e., the 
probability of network w given the data, as P{w\D) cx P{D\w)P{'w) up to an overall normalization 
factor. From this, finally, we get the predictive distribution P{y\x,D) = J dwP{y\x,w)P{w\D). 
This solves the regression problem in principle, but leaves us with a nasty integral over all possible 
network weights: the posterior P{w\D) generally has a highly nontrivial structure, with many local 
peaks (corresponding to local minima in the training error). One therefore has to use sophisticated 



Monte Carlo integration techniques (Neal, 1993) or local approximations to P{w\D) around its 



maxima (MacKay, 1992) to tackle this problem. Even once this has been done, one is still left with 
the question of how to interpret the results: We may for example want to select priors on the basis of 
the data, e.g. by making the prior P{w\h) dependent on a set of hyperparameters h and choosing h 
such as to maximize the probability P{D\h) of the data. Once we have found the 'optimal' prior we 
would then hope that it tells us something about the regression problem at hand (whether certain 
input components are irrelevant, for example). This would be easy if the prior told us directly how 
likely certain input-output functions are; instead we have to extract this information from the prior 
over weights, often a complicated process. 

By contrast, for a Gaussian process it is an almost trivial task to obtain the posterior and the 
predictive distribution (see below). One reason for this is that the prior P{0) is defined directly 
over input-output functions 9. How is this done? Any 9 is uniquely determined by its output 
values 9{x) for all x from the input domain, and for a Gaussian process, these are simply assumed 
to have a joint Gaussian distribution (hence the name). This distribution can be specified by the 
mean values {9{x))g (which we assume to be zero in the following, as is commonly done), and 
the covariances {9{x)9{x'))g = C{x,x'); C{x,x') is called the covariance function of the Gaussian 
process. It encodes in an easily interpretable way prior assumptions about the function to be 
learned. Smoothness, for example, is controlled by the behaviour of C{x,x') for x' — > x: The 
Ornstein-Uhlenbeck (OU) covariance function C{x,x') cx exp(— ||a; — a;'||/0 produces very rough 
(non-differentiable) functions, while functions sampled from the radial basis function (RBF) prior 
with C{x,x') cx exp[— ||x — x'|p/(2P)] are infinitely differentiable. (Intermediate priors yielding r 
times differentiable functions can also be defined by using modified Bessel functions as covariance 



functions; see (Williams and Vivarelli, 2000).) Figure [l] illustrates these characteristics with two 



samples from the OU and RBF priors, respectively, over a two-dimensional input domain. The 
'length scale' parameter / in the covariance functions also has an intuitive meaning: It corresponds 
directly to the distance in input space over which we expect our function to vary significantly. More 
complex properties can also be encoded; by replacing I with different length scales for each input 
component, for example, relevant (small I) and irrelevant (large I) inputs can be distinguished. 
How does inference with Gaussian processes w ork? We only g ive a brief summary here and 



refer to existing reviews on the subject (see e.g. ( Williams, 1998 )) for details. It is simplest to 
assume that outputs y are generated from the 'clean' values of a hypothesis function 9{x) by adding 
Gaussian noise of a;- independent variance a^. The joint distribution of a set of n training outputs 
{yi} and the function values 9{x) is then also Gaussian, with covariances given by 



{yWrn) = C{xi,X,n) + Cr Sim ^ (K) 



Im 



{yi9{x)) = C{xi,x) = (k(x))i 
where we have defined an n x n matrix K and an z-dependent n-component vector \^{x). The 
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Figure 1: Samples drawn from Gaussian process priors over functions on [0, 1]^. Left: OU covariance 
function, C{x,x') = exp(— — Right: RBF covariance function, C{x,x') = exp[— — 

a;' IP/ (2Z^)]. The length scale ^ = 0.1 determines in both cases over what distance the functions vary 
significantly. Note the difference in roughness of the two functions; this is related to the behaviour 
of the covariance functions for x x' . 



posterior distribution P(6\D) is then obtained by simply conditioning on the {yi}. It is again 
Gaussian and has mean 

(1) 



{x,D)^{e{x)),^^^k{x) 



and variance 



e{x,D) = ({e{x)-9{x)y' 



no 



= C(x,x) -k(x)^K-ik(.T) 



(2) 



Eqs. (0,^) solve the inference problem for Gaussian process: They provide us directly with the 
predictive distribution P{9{x)\x, D). The posterior variance, eq. (||), in fact also gives us the 
expected generalization error (or Bayes error) at x. Why? If the teacher is 9* , the squared deviation 
between our mean prediction and the teacher outputnis {9{x) — 9*{x))'^; averaging this over the 
posterior distribution of teachers P{9*\D) just gives (g). The underlying assumption is that our 
assumed Gaussian process prior is the true one from which teachers are actually generated (and 
that we are using the correct noise model). Otherwise, the expected generalization error is larger 
and given by a more complicated expression (Williams and Vivarelli, 2000). In line with most 
other work on the subject, we only consider the 'correct prior' case in the following. Averaging the 
generalization error at x over the distribution of inputs gives then 



eiD) = (e(x, D))^ - {Cix, x) - k(a;)TK-ik(x))^ 



(3) 



This form of the generalization error, which is well known (Michelli and Wahba, 1981; Opper, 1997; 



Williams, 1998; Williams and Vivarelli, 2000), still depends on the training inputs; the fact that 



the training outputs have dropped out already is a signature of the fact that Gaussian processes are 
linear predictors (compare (|^)). Averaging over data sets yields the quantity we are after, 



(4) 



This average expected generalization error (we will drop the 'average expected' in the following) 
only depends on the number of training examples n; the function e(n) is called the learning curve. 



^One can also measure the generalization by t 
teacher output; this simply adds a term to eq. (I 



squared deviation between the prediction 0{x) and the noisy 
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n = 2 n = 100 




X X 



Figure 2: Generalization error e(x,D) as a function of input position x G [0,1], for noise level 
= 0.05, RBF covariance function C{x,x') — exp[—\x — x'\'^/{2P)] with I = 0.1, for randomly 
drawn training sets D of size n = 2 (left) and n = 100 (right). To emphasize the difference in scale, 
the plot on the left actually also includes the results for n = 100, just visible below the dashed line 
at e{x, D) = cr^. 



Its exact calculation is difficult because of the joint average in eqs. (|[Q) over the training inputs Xi 
and the test input x. 

Before proceeding with our calculation of the learning curve e(n), let us try to gain some intuitive 
insight into its dependence on n. Consider a simple example scenario, where inputs x are one- 
dimensional and drawn randomly from the unit interval [0,1], with uniform probability. For the 
covariance function we choose an RBF form, C{x, x') = exp[— [x — a;'|^/ (2Z^)] with I — 0.1. Here we 
have taken the prior variance C{x,x) as unity; as seems realistic for most applications we assume 
the noise level to be much smaller than this, cr^ — 0.05. Figure ^ illustrates the x-dependence of 
the generalization error e(x,D) for a small training set (n = 2): Each of the examples has made 
a 'dent' in e{x,D), with a shape that is similar to that of the covariance function^. Outside the 
dents, e(x, D) still has essentially its prior value, e{x, D) = 1; at the centre of each dent it is reduced 
to a much smaller value, £{x,D) « o'^/(l + cr^) (this approximation holds as long as the different 
training inputs are sufficiently far away from each other). The generalization error e{D) is therefore 
dominated by regions where no training examples have been seen; one has e(-D) ^ cr^, and the 
precise value of e{D) depends only very little on (assuming always that cr^ <C 1). Gradually, as 
n is increased, the covariance dents will cover the input space, so that e(x, D) and e(-D) become 
of order this situation is shown on the right of Figure |^. From this point onwards, further 
training examples essetially have the effect of averaging out noise, eventually making e{D) ^ for 
large enough n. In summary, we expect the learning curve e(n) to have two regimes: In the initial 
(small n) regime, where e{n) ^ cr^, e(n) is essentially independent of cr^ and reflects mainly the 
geometrical distribution of covariance dents across the inputs space. In the asymptotic regime {n 
large enough such that e(n) ^ ct^), on the other hand, the noise level ct^ is important in controlling 
the size of t{n) because learning arises mainly from the averaging out of noise in the training data. 

^More precisely, the dents have the shape of the square of the covariance function: If the training inputs Xi are 
sufficiently far apart, then around each Xi we can neglect the influence of the other data points and apply (^) with 
n = 1, giving t{x, D) fa C(x, x) — C'^ {x, Xi) /[C{xi, Xi) + a^]. 
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2 Approximate learning curves 



Calculating learning curves for Gaussian processes exactly is a difficult problem because of the joint 
average in (^,^ over the training inputs xi and the test input x. Several workers have therefore 
derived upper and lower bounds on e ( Michelli and Wahba, 198l] ; Plaskota, 1990; Opper, 1997 



Williams and Vivarelli, 2000| ) or studied the large n asymptotics of e(n) (Silverman, 1985; Ritter 



1996|). As we will illustrate below, however, the existing bounds are often far from tight; likewise, 
asymptotic results can only capture the large n regime defined above and will not necessarily apply 
for sample sizes n occurring in practice. We therefore now attempt to derive approximations to 
e{n) which get closer to the true learning curves than existing bounds, and which are applicable 
both for small and large n. 

As a starting point for an approximate calculation of e(ri), we first derive a representation of the 
generalization error in terms of the eigenvalue decomposition of the covariance function. Mercer's 
theorem (see e.g. (Wong, 1971)) tells us that the covariance function can be decomposed into its 
eigenvalues and eigenfunctions (f>i{x): 



C{x,x') = ^X^(j>i{x)(j)i{x') 



(5) 



i=l 



This is simply the analogue of the eigenvalue decomposition of a finite symmetric matrix. We 
assume here that eigenvalues and eigenfunctions are defined relative to the distribution over inputs 

{C{x,x')M^'))^, = KM^) (6) 

The eigenfunctions are then orthogonal with respect to the same distribution, {(j)i{x)(j)j{x)) ^ = Si 
(see e.g. (Williams and Seeger, 2000)). Now write the data-dependent generalization error i 

eiD) = {C{x, x))^ - tr (k(a.)k(x)T)^ K ' 

and perform the x-average: 

((k(a;)k(x)'^);™)^ = '^\i\j4ii{xi) {(j)i{x)(j)j{x)) ^(f)j{Xm) = Xf(j>i{xi)(j)i{Xm)- 



This suggests introducing the diagonal matrix (A)jj — XiSij and the 'design matrix' {'^)u = (f)i{xi), 
so that (k(a:)k(x)'^)_^ = ^A^*'^. One then also has 

{C{x,x))^^trA (7) 

and the matrix K is expressed as 

K = cr^I + *A*^ 

with I being the identity matrix. Collecting these results, we have 

e{D) =trA-tr(cr2l+^A*'^)"i*A2*'^ 

This can be simplified using the Woodbury formula]^ to give^ e(n) = {e{D))jj with 

e{D) = tr A(I + cr-^^*^*)-! = tr (A^^ + (j-^^'^^)-! (g) 

The advantage of this (still exact) representation of the generalization error is that the average over 
the test input x has already been carried out, and that the remaining dependence on the training 
data is contained entirely in the matrix SJ'"'"^. It also includes as a special case the well-known 



^( A +TTV '^^ i=A ^— A ^U(I+V'^A ^U) ^V'^A ^ for matrices A, U, V of appropriate size; see e.g. (Press 



et al., 19921 ). 

llf the covariance function has zero eigenvalues, the inverse A~^ does not exist, and the first form of t{D) given 
in (b|) must be used; similar alternative forms, though not explicitly written, exist for all following results. 
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result for linear regression (see e.g. (|Sollich, 199^ )); A^^ and can be interpreted as suitably 



generalized versions of the weight decay (matrix) and input correlation matrix. 

Starting from (^), one can now derive approximate expressions for the learning curve e(n). The 
most naive approach is to neglect entirely the fluctuations in '4'"'"^' over different data sets and 
replace it by its average, which is simply — X^/Li {'t'i{^i)4>j{xi)) d ~ '"^^ij- This leads 

to 

eov(n) =tr(A-i + cr-2„l)-i. (9) 

While this is not, in general, a good approximation, it was shown by Opper and Fivarelli to be 
a lower hound (called OV bound below) on the learning curve ( Opper and Vivarelli, 1999| ). It 



becomes tight in the large noise limit cr^ ^ 00 at constant nja^: The fluctuations of the elements 
of the matrix a~^'^"^'^ then become vanishingly small (of order ^Jna~'^ = (nja^^j^ph 0) and so 
replacing by its average is justified. 

To derive better approximations, it is useful to see how the matrix Q = (A~^ + tT^^'S''^'4')~^ 
changes when a new example is added to the training set. This change can be expressed as 



gin + l)-g{n) = [g-i(n)+(7-V^^J -Gin) 



(10) 



in terms of the vector ip with elements = (f>i{xn+i). To get exact learning curves, one would 
have to average this update formula over both the new training input Xn+i and all previous ones. 
This is difficult, but progress can be made by neglecting correlations of numerator and denominator 



in (10), averaging them separately instead. Also treating n as a continuous variable, this yields the 
approximation 

dGjn) (GHn)) 
dn CT2+trG(n) ^ ' 

where we have introduced the notation G (Q). If we also neglect fluctuations in Q, approximating 
(G^) = G^, this equation can easily be solved and yields G~^(n) = A^^ + a^'^n'l and so 

e^c{n) = tr (A^^ + a-^n'I)-'^ (12) 

with n' determined by the self-consistency equation 

n' + tr ln(I + (T^^n'A) — n. 

By comparison with (|^), n' can be thought of as an 'effective number of training examples'. The 
subscript UC in ( p^ ) stands for f/pper Continuous (i.e., treating n as continuous) approximation. 
A better approximation with a lower value is obtained by retaining fluctuations in Q. As in the 



case of the linear perceptron (SoUich, 1994), this can be achieved by introducing an auxiliary offset 



parameter v into the definition of 

g-^ =vI + A-^ +a-^'if^^. (13) 

One can then write 

- tr = l-tr (g) = de/dv (14) 



dv 

and obtains from ([III ) the partial differential equation 

de 1 de 



dn a'^ + e dv 



= (15) 



This can be solved for e{n,v) using the methods of characteristic curves (see App. ^). Resetting 
the auxiliary parameter v to zero yields the Lower Cbntinuous approximation to the learning curve, 
which is given by the self-consistency equation 

e^c{n)=tv (a-' + ^ 1) . (16) 

V (T^ + eLc / 
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It is easy to show that eLc < euc- One can also check that both approximations converge to 
the exact resuh in the large noise limit (as defined above). Encouragingly, we see that the LC 
approximation reflects our intuition about the difference between the initial and asymptotic regimes 
of the learning curve: For e ^ cr^, we can simplify ( [l6| ) to 



-1 

n 



eLc(ri)-tr A-^ + 1 

where as expected the noise level has dropped out. In the opposite limit e <C cr^, on the other 
hand, we have 

eLc(n)-tr + (17) 

which — again as expected — retains the noise level as an important parameter. Eq. ( [T^ ) also 
shows that elc approaches the OV lower bound (from above) for sufficiently large n. 

We conclude this section with a brief qualitative discussion of the expected n-dependence of elc 
(which below will turn out to be the more accurate of our two approximations). Obviously this 
n-dependence depends on the spectrum of eigenvalues A^; below we always assume that these are 
arranged in decreasing order. Consider then first the asymptotic regime e ^ cr^, where elc and eov 
become identical. One then shows easily that for eigenvalues decaying as a power-law, Xi ^ the 
asymptotic learning curve scales aspl £ lc ^ (n/cr^)"^''"^^/''; this is in agreement with known exact 
results (Silverman, 1985; Ritter, 1996| ). In the initial regime e ^ cr^, on the other hand, one can 



take (7^ ^ and finds then a faster decay^ of the generalization error, clc ^ n~^^^^\ We are not 
aware of exact results pertaining to this regime, except for the OU case in c? = 1 which has r = 2 
and for which thus eLc ~ in agreement with an exact calculation (Manfred Opper, private 
communication) . 



3 Comparisons with bounds and numerical simulations 

We now compare the LC and UC approximations with existing bounds, and with the 'true' learning 
curves as obtained by numerical simulations. A lower bound on the generalization error was given 



by Michelli and Wahba (Michelh and Wahba, 1981) as 



e(n) > CMwin) = J2 (1^) 

i—n-\-l 

This bound is derived for the noiseless case by allowing 'generalized observations' (projections of 
9*{x) along the first n eigenfunctions of C{x, x')), and so is unlikely to be tight for the case of 'real' 
observations at discrete input points. Given that the bound is derived from the (t^ — > limit, it can 
only be useful in the initial (small n, e ^ cr^) regime of the learning curve. There, it confirms the 
conclusion of our intuitive discussion above that the learning curve has a lower limit below which 
it will not drop even for —^ 0. 

Plaskota ( Plaskota, 199C| ) generalized the MW approach to the noisy case and obtained the 



following improved lower bound: 



e(n)>epi(n)=min^^-^+ ^ A, (19) 

where the minimum is over all non- negative 771,..., ?7„ obeying Yl^=i''li ^ S = YM=i^i^i'^^i)- 
Plaskota only derived this bound for covariance functions for which the prior variance C{x,x) 

^ Among the scenarios studied in the next section, the OU covariance function provides a concrete example of this 
kind of behaviour: In d = 1 dimension it has, from (k:4|), ^ and thus r = 2. The RBF covariance function, on 
the other hand, has eigenvalues decaying faster than any power law, corresponding to r — ► 00 and thus elc f^/n 
(up to logarithmic corrections) in the asymptotic regime. 

® For covariance functions with eigenvalues decaying faster than a power law, the behaviour in the initial regime 
is nontrivial; for the RBF covariance function in d = 1 with uniform inputs, for example, we find (for large n) 
~ ncxp{—cn^) with some constant c. 
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is independent of x. We call these 'uniform' covariance functions; due to the general identity 
{C{x,x)) = trA, they obey C{x^x) = trA (and hence S = ntrA). We recap Plaskota's proof 
in App. y| and also show there that it extends to general covariance functions in the form stated 
above. The Plaskota bound is close to the MW bound in the small n regime (equivalent to cr^ ^ 0); 
for larger n it becomes substantially larger. It therefore has the potential to be useful for both 
small and large n. Note that, in contrast to all other bounds discussed in this paper, the MW and 
Plaskota bounds are in fact 'single data set' (worst case) bounds: They apply to e(-D) for any data 
set D of the given size n, rather than just to the average]^ e(n) = {e{D)). 



Opper used information theoretic methods to obtain a different lower bound (Opper, 1997), but 
we will not consider this because the more recent OV bound (^ is always tighter. Note that the 
OV bound incorrectly suggest that e decreases to zero for ct^ ^ at fixed n. It therefore becomes 
void for small n (where e ^ cr^) and is expected to be of use only in the asymptotic regime of large 
n. 



There is also an Upper bound due to Opper ( pppcr, 1997 ) 



e(n) < evo{n) = {(J^^n)-'^ tr ln(I + <j-^nA) + tr (A^^ + cr^^^i)-! (20) 

Here e is a modified version of e which (in the rescaled version that we are using) becomes identical 
to e in the limit of small generalization errors (e ^ ct^), but never gets larger that 2a^; for small 
n in particular, e(n) can therefore actually be much larger than e(n) and its bound (pO|). For this 
reason, and because in our simulations we never get very far into the asymptotic regime e ^ ct^, 
we do not display the UO bound in the graphs below. 



The UO bound is complemented by an upper bound due to MUiams and Vivarelli (Williams 
and Vivarelli, 2000), which never decays below values around and is therefore mainly useful 



in the initial regime e 3> cr^. It applies for one-dimensional inputs x and stationary covariance 
bmctions — for which C{x, x') = Cs{x — x') is a function of x — x' alone — and reads: 

1 f°° 

e{n) < ewv(n) = ^(0) - ^^^^^ ^ ^2 fn{a) Cl{a) (21) 

with 

/„(a) = 2(1 - a)"e(l - a) + 2(n - 1)(1 - 2a)"e(l - 2a) (22) 

and where the Heaviside step functions 8 (defined as 8(2:) = 1 for 2: > and = otherwise) in the 
two terms imply that only values of a up to 1 and 1/2, respectively, contribute to the integral in (pl|). 
The function /„(a) is a normalized distribution over a which for n — > cx) becomes peaked around 
a = 0, implying that the asymptotic value of the bound is ewv('T- 00) = Cs(0)cr^/[Cs(0) + cr^] ~ 
for ^ Cs{0). The derivation of the bound is based on the insight that e{x, D) always decreases 
as more examples are added; it can therefore be upper bounded for any given x by the smallest 
e{x^ D') that would result from training on any data set D' comprising only a single example from 
the original training set D. The idea can be generalized to using the smallest e{x,D') obtainable 



from any two of the training examples, but this does not significantly improve the bound (Williams 



and Vivarclh, 200C ) 



As stated above in (gl|,^2|), the WV bound applies only to the case of a uniform input distribu- 
tions over the unit interval [0, 1]. However, it is relatively straightforward to extend the approach to 
general (one-dimensional) input distributions P{x); only the data set average becomes technically 
a little more complicated. We omit the details and only quote the result: Eq. ( ^l| ) remains valid if 
the expression ( |2^ ) for /„(a) is generalized to 

Ua) = n{P{x^a)[l-Q{x)r-'+P{x + a)[Q{x)r-')^ 

+ n{n - 1) (8(x' - x - 2a)[P{x + a) + P{x' - a)][l + Q{x) - Q(a;')]""^)^ (23) 

where Q{z) — J^^dx P{x) is the cumulative distribution function. In the simpler scenario consid- 
ered by Williams and Vivarelli this can be shown to reduce to (p2[), while in the most general case 
the numerical evaluation of the bound requires a triple integral (over x, x' and a). 

'^Our generalized version of the Plaskota bound depends on the specific data set only through the value of 
S = C{xi,xi). To obtain an average case bound one would need to average over the distribution of S. 
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Finally, there is one more upper bound, due to Trecate, M^iUiams and Opper ( Trecate et al 



1999|); based on the generalization error achieved by a 'suboptimal' Gaussian regressor, they showed 



e(n) < exwo M = tr A — rt — 

^ Ci 

I 

where 

c, = {n - 1)A, + a^ + {C{x, x)4>^{x))^ . (24) 

For a uniform covariance fimction, the average in Ci becomes tr A (^(j)f{x)'j^ = tr A and the bound 
simplifies to 



trA + cr2 + (n- f)A, 



We now compare the quality of these bounds and our approximations with numerical simulations 
of learning curves. All the theoretical expressions require knowledge of the eigenvalue spectrum of 
the covariance function, so we focus on situations where this is known analytically. We consider 
three scenarios: For the first two, we assume that inputs x are drawn from the d-dimensional unit 
hypercube [0, 1]'', and that the input density is uniform. As covariance functions we use the RBF 
function C{x,x') = exp[— ||x — a;'|p/(2Z2)] and the OU function C{x,x') = cxp(— ||a; — x'\\/l), to 
have the extreme cases of smooth and rough functions to be learned; both contain a tunable length 
scale I. To be precise, we use slightly modified versions of the RBF and OU covariance functions 
(using what physicists call 'periodic boundary conditions'Vwhich make the eigenvalue calculations 
analytically tractable; the details are explained in App. |C|. In the third scenario we explore the 
effect of a non-uniform input distribution by considering inputs x drawn from a c?-dimensional (zero 
mean) isotropic Gaussian distribution P{x) cx exp[— ||a;|p/(2(7^)], for an RBF covariance function. 
Details of the eigenvalue spectrum for this case can also be found in App. |c[ Note that in all 
three cases, the covariance function is uniform, i.e. has a constant variance C'{x,x); we have fixed 
this to unity without loss of generality. This leaves three variable parameters: the input space 
dimension d, the noise level and the length scale I. As explained above we generically expect 
the prior variance to be significantly larger than the noise on the training data, so we only consider 
values of (7^ < 1. The length scale I should also obey ^ < 1; otherwise the covariance functions 
C{x,x') would be almost constant across the input space, corresponding to a trivial GP prior of 
essentially ^-independent functions. We in fact choose the length scale I for each d in such a way 
as to get a reasonable decay of the learning curve within the range of n = . . . 300 that can be 
conveniently simulated numerically. To see why this is necessary, note that each covariance 'dent' 
covers a fraction of order l'^ of the input space, so that the number of examples n needed to see a 
significant reduction in generalization error e will scale as (l/l)'^. This quickly becomes very large 
as d increases unless I is increased simultaneously. (The effect of larger I leading to a faster decay 
of the learning curve was also observed in ( Williams and Vivarelli, 2000| ).) 



In the following figures, we show the lower bounds (Plaskota, OV), the non-asymptotic upper 
bounds (TWO and, for d = 1 with uniform input distribution, WV), and our approximations 
(LC and UC). The true learning curve as obtained from numerical simulations is also shown. For 
the numerical simulations, we built up training sets by randomly drawing training inputs from 
the specified input distribution. For each new training input, the matrix inverse has to be 
recalculated. By partitioning the matrix into its elements correspo nding to the old an d new inputs. 



this inversion can be performed with ©(n^) operations (see e.g. ( Press et al., 1992 )), as opposed 
to C(n^) if the inverse is calculated from scratch every time. With known, the generalization 
error e{D) was then calculated from (H), with the average over x estimated by an average over 
randomly sampled test inputs. This process was repeated up to our chosen rtmax = 300; the results 
for e(D) were then averaged over a number of training set realizations to obtain the learning curve 
e(n). In all the graphs shown, the size of the error bars on the simulated learning curve is of the 
order of the visible fluctuations in the curve. 

In Figure ^ we show the results for an OU covariance fimction with inputs from [0, 1]*^, for d = 1 
(left) and d = 2 (right). One observes that the lower bounds (Plaskota and OV) are rather loose in 
both cases. The TWO upper bound is also far from tight; the WV upper bound is better where it 
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Figure 4: Learning curve for a GP with RBF covariance function and inputs uniformly drawn from 
X € [0, 1]'', at noise level = 0.05; dimension d = 1, length scale I — 0.01. We also show the MW 
bound here to show how it is first close to the Plaskota bound but then "misses" the change in 
behaviour where the generalization error crosses over into the asymptotic regime (e ^ cr^). 



can be defined (for d = 1). Our approximations, LC and UC, are closer to the true learning curve 
than any of the bounds and in fact appear to bracket it. 

Similar comments apply to Figure ^, which displays the results for an RBF covariance function 
with inputs from [0, 1]. Because functions from an RBF prior are much smoother than those from 
an OU prior, they are easier to learn and the generalization error e shows a more rapid decrease 
with n. This makes visible, within the range of n shown, the anticipated change in behaviour as e 
crosses over from the initial (e S> cr^) to the asymptotic (e <C cr'^) regime. The LC approximation 
and, to a less quantitative extent, the Plaskota bound both capture this change. By contrast, the 
OV bound (as expected from its general properties discussed above) only shows the right qualitative 
behaviour in the asymptotic regime. 

Figure ^ shows corresponding results in higher dimension (d = 4), at two different noise levels 
a^. One observes in particular that the OV lower bound becomes looser as decreases; this is as 
expected since for cr^ ^ the bound actually becomes void (eov 0). The Plaskota bound also 
appears to get looser for lower cr^ , though not as dramatically. (Note that the kinks in the Plaskota 
curve are not an artifact: For larger d the multiplicities of the different eigenvalues can be quite 
large; the value of epi can become dominated by one such block of degenerate eigenvalues, and kinks 
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Figure 5: Learning curve for a GP with RBF covariance function and inputs uniformly drawn from 
X G [0, l]'^, for dimension d = A and length scale I = 0.3. Left: noise level — 0.05. Right: 
(7^ — 0.001. Note that, for lower cr^, the OV bound becomes looser as expected (it approaches zero 
for cr2 ^ 0). 



occur where the dominant block changes.) The TWO upper bound, finally, is only weakly affected 
by the value of and quite loose throughout. 

All results shown so far pertain to uniform input distributions (over [0, 1]''). We now move to the 
last of our three scenarios, a GP with an RBF covariance function and inputs drawn from a Gaussian 
distribution (see App. ^ for details). In Figure || we see that in d = 1 the (generalized) WV bound 
is still reasonably tight, while the LC approximation now provides less of a good representation of 
the overall shape of the learning curve than for the case of uniform input distributions. However, 
as in all previous examples, the LC and UC approximations still bracket the true learning curve 
(and come closer to it than the bounds). One is thus lead to speculate whether the approximations 
we have derived are actually bounds. Figure |^ shows this not to be the case, however: In d — A, 
the true learning curve drops visibly below the LC approximation in the small n regime, and so the 
latter cannot be a lower bound. The low noise case (cr^ ~ 0.001) shown here illustrates once more 
that the OV lower bound ceases to be useful for small noise levels. 

In summary, of the approximations which we have derived, the LC approximation performs 
best. While we know on theoretical grounds that it will be accurate for large noise levels ct^, the 
examples shown above demonstrate that it produces predictions close to the true learning curves 
even for the more realistic case of low noise levels (compared to the prior variance). As a general 
trend, agreement appears to be better for the case of uniform input distributions. 

It is interesting at this stage to make a connection to the recent work of Malzahn and Op- 
per (Malzahn and Opper, 2001). They devised an elegant way of approaching the learning curve 
problem from the point of view of statistical physics, calculating the relevant partition function 
(which is an average over data sets) using a so-called Gaussian variational approximation. The 
result they find for the Bayes error is identical to the LC approximation under the condition that 
e{x) = (e(x, D)) jj, the x-dependent generalization error averaged over all data sets, is independent 
of X. Otherwise, they find a result of the same functional form, e(n) = tr (A^^ + 7/I)~^, but the 
self-consistency equation for is more complicated than the simple relation rj = n/^e + a'^) obtained 
from the LC approximation (|l^). The LC approximation would thus be expected to perform less 
well for such "non-uniform" scenarios. This agrees qualitatively with our above findings: For the 
scenario with a Gaussian input distribution, the LC approximation is of poorer quality than for the 
cases with uniform input distributions^ over [0, 1]''. 



® It is easy to see that in these cases e(x) is indeed independent of x; the absence of effects from the boundaries 
of the hypercube comes from the periodic boundary conditions that we are using. 
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Figure 6: Learning curve for a GP with RBF covariance function (length scale I = 0.01) and inputs 
drawn from a Gaussian distribution in d — 1 dimension; noise level = 0.05. Note that the LC 
approximation provides less of a good representation of the overall shape of the learning curve here 
than for the previous examples with uniform input distributions. The curve labelled WV shows our 
generalized version of the Wilhams-Vivarelli bound (see eq. (|23|)). 




Figure 7: Learning curve for a GP with RBF covariance function (length scale I — 0.3) and inputs 
drawn from a Gaussian distribution in d = 4 dimensions. Left: noise level = 0.05. Right: 
— 0.001. The OV lower bound is very loose for this smaller noise level. Note that, in contrast 
to all previous examples, there is a range of n here where the true learning curve lies below the LC 
approximation. 
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4 Improving the approximations 



In the previous section we saw that in our test scenarios the LC approximation ( [l6|) generally pro- 
vides the closest theoretical approximation to the true learning curves. This may appear somewhat 



surprising, given that we made two rather drastic approximations in deriving (16): We treated 
the number of training examples n as a continuous variable, and we decoupled the average of the 
right hand side of (|l^) into separate averages over numerator and denominator. We now investi- 
gate whether the LC prediction for the learning curves can be further improved by removing these 
approximations. 

We begin with the effect of n, the number of training examples, taking only discrete (rather 
than continuous) values. Starting from (p^, averaging numerator and denominator separately as 
before and introducing the auxiliary variable v as in ( p^ , p^ , we obtain 

ein+l)-e{n) = ^^^ (25) 
a'' + eov 

instead of (|l5|). It is possible to interpolate between the two equations by writing 

he{n + 5)~e{n)] = ^^^ (26) 
(T^ + e ov 

Then 5—1 corresponds to (Eq), which is the equation we wish to solve (discrete n), while in the 



limit (5 — > we retrieve (15). To proceed, we treat 5 as a perturbation parameter and assume that 



the solution of (|2^) can be expanded as 

e = eo + 5ei + 0{S^) 
where eo = clc • Expanding both sides of ( p^ ) to first order in 5 yields 



dn dn 2 drfl \a'^ + [a'^ + eq)'^ J \ dv dv 



Comparing the coefficients of the 0{5^) terms gives us back (15) for eg, while from the 0{5) terms 
we get 

dei 1 dei _ I d'^ep ei dtp 

dn o"^ + tQ dv 2 dn"^ (cr^ + e^Y '^''^ 
This can again be solved using characteristics (see App. with the result 

ei = {a' + eo) ;)°^~°;] , a, = (a^ + eo)-'^ tr f + ^^l) (28) 
(1-7102)^ V cr +eo y 



Setting 6 back to 1 to have the case of discrete n in (^6|), we then have 

clci — ^0 + ^1 = clc + Cl 

as the improved LC approximation that takes into account the effects of discrete n (up to linear 
order in a perturbation expansion in 6). We see that the correction term (^ ) is zero at n = as it 
must since elc gives the exact result e = tr A there. It can also be shown that ei < for all nonzero 
n. This can be understood as follows: The decrease of e(n) becomes smaller (in absolute value) 
as n increases. Comparing (^|) and (^), we see that the continuous n-approximation effectively 
averages the decrease term over the range n . . .n+1 rather than evaluating it at n itself; it therefore 
produces a smaller decrease in e{n). The true decrease for discrete n is larger, and so one expects 
the correction ei to be negative, in agreement with our calculation. 

In Figure || we show eLc and ei for one of the scenarios considered earlier; the results are typical 
also of what we find for other cases. The most striking observation is the smallness of ei: Its absolute 
value is of the order of 1% of clc or less, and consequently elc and elci are indistinguishable on the 
scale of the plot. On the one hand, this is encouraging: Given that ei is already so small, one would 
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Figure 8: Solid line: LC approximation elc for the learning curve of a GP with RBF covariance 
function and Gaussian inputs. Parameters are as in Figure ^(left): Dimension d = 4, length scale 
I = 0.3, noise level cr^ = 0.05. Dashed line: First order correction ei arising from the discrete nature 
of the number of training examples n. Note that ei has been multiplied by -100 to make it positive 
and visible on the scale of the values of elc- 



expect higher orders in a perturbation expansion in 6 to yield even smaller corrections. Thus eLCi 
is likely to be very close to the result that one would find if the discrete nature of n was taken into 
account exactly. On the other hand, we also conclude that treating n as discrete is not sufficient 
to turn the LC approximation into a lower bound on the learning curve; in Fig ^, for example, the 
curve for eLCi would lie essentially on top of the one for eLc a-nd so still be significantly above the 
true learning curve for small n. 

It is clear at this stage that in order to improve the LC approximation significantly one would 
have to address the decoupling of the numerator and denominator averages in (^0|) . Generally, if a 
and b are random variables, one can evaluate the average of their ratio perturbatively as 



Aa 



(b) + Ab 



(a) 
(b) 



{{Abf) 



{by 



(AaAb) 



up to second order in the fluctuations. (This idea was used in (SoUich, 1994) to calculate finite 
A'^-corrections to the N oo limit of a linear learning problem.) To apply this to the average of 
the right hand side of (^0|) over the new training input Xn+i and all previous ones, one would set 

a = Q{n)xjjxjj^Q{n), b = a"^ + Q {71)11) 

One then sees that averages such as {ab), required in {AaAb) = (ab) — (a) (6), involve fourth- 
order averages {(j)i{x)4>j{x)(j)k{x)(j)i{x))^ of the components of V- In contrast to the second-order 
averages {(j)i{x)(j)j{x)) = 6ij, such fourth-order statistics of the eigenfunctions do not have a simple, 
covariance function-independent form. Even if they were known, however, one would end up with 
averages over the entries of the matrix Q which cannot be reduced to e = tr (Q) (e.g. by derivatives 
with respect to auxiliary parameters). Separate equations for the change of these averages with n 
would then be required, generating new averages and eventually an infinite hierarchy which cannot 
be closed. We thus conclude that a perturbative approach is of little use in improving the LC 
approximation beyond the decoupling of averages. The approach of ( Malzahn and Opper, 200l] ) 
thus looks more hopeful as far as the derivation of systematic corrections to the approximation is 
concerned. 
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5 How good can bounds and approximations be? 



In this final section we ask whether there are hmits of principle on the quality of theoretical predic- 
tions (either bounds or approximations) for GP learning curves. Of course this question is mean- 
ingless unless we specify what information the theoretical curves are allowed to exploit. Guided by 
the insight that all predictions discussed above depend (at least for uniform covariance functions) 
on the eigenvalues of the covariance function only (and of course the noise level cr^), we ask: How 
tight can bounds and approximations be if they use only this eigenvalue spectrum as input? 

To answer this question, it is useful to have a simple scenario with an arbitrary eigenvalue 
spectrum for which learning curves can be calculated exactly. Consider the case where the input 
space consists of N discrete points x"; the input distribution is arbitrary, P(x") = pa with J2a ~ 
1. Take the covariance function to be degenerate in the sense that there are no correlations between 
different points: C'{x°',x^) = CaSap- The eigenvalue equation (^ then becomes simply 

Y,C{x'',x^)cl){x0)pf, = c^PaHx") = A0(x") 

so that the N different eigenvalues are Aq — CaPa- The eigenfunctions are 4>a{x^) — Pa^^'^^ap, 
where the prefactor follows from the normalization condition 'Yli-yPi4'a{x^)4'p{^"') — ^afS- Note 
that by choosing the Ca and pa appropriately, the Xa can be adjusted to any desired value in this 
setup. The same still holds even if we require the covariance function to be uniform, i.e., Cq to be 
independent of a. 

A set of n training inputs is, in this scenario, fully characterized by how often it contains each 
of the possible inputs x" ; we call these numbers . The generalization error is easy to work out 
from (H), using 

{^^^)aP = ^n^(l)a{x'^)<l>p{x^) = {na/Pa)Sal3- 

7 

This shows that '^^'^ is a diagonal matrix, and thus from (||) 

e{D) = tr (A-i + ^-^M/T*)-! ^ ^ (A"! + a-^/p^y' ^ 2 ^ ''\ / (29) 

This has the expected form: The contribution of each eigenvalue is reduced according to the ratio 
of the noise level ct^ and the signal riaXa/Pa = n-aCa received at the corresponding input point. To 
average this over all training sets of size n, one notices that Ua has a binomial distribution, so that 

" / \ 2 
e(n)=^A„ E in 2^ ''x / 

n \ J O- + UaXa Pa 

Writing (cr^ + n^Xa/Pa)^^ = Io'^''~ r"°^°^^°'^ , we can perform the sum over and obtain 

e{n) ^ j^drY^Xa (l - Pa + Par^°/^°"')" (30) 



Q 



as the final result; the integral over r can easily be performed numerically for given set of eigenvalues 
Xa and input probabilities Pa- Note that, having done the calculation for a finite number TV 
of discrete inputs points (and therefore of eigenvalues), we can now also take N to infinity and 
therefore analyse scenarios (such as the ones studied in Sec. H) with an infinite number of nonzero 
eigenvalues. 

A simple limiting case will now tell us about the quality of eigenvalue-dependent upper bounds 
on learning curves. Assume that one of the pa is close to 1, whereas all the other ones are close 
to 0. From (|30|), one then sees that only the contribution from the eigenvalue Xa with w 1 is 
reduced as n increase^ while all other ones remain unaffected, so that 

e{n)^tvK-Xa+ .^""^^ =trA- ^ > tr A - A^ (31) 
+ nXa o"' + nXa 



^This holds if n is not too large, more precisely if npfj <C 1 for all the 'small' {/3 ^ a). 
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Figure 9: Learning curve for a GP with RBF covariance function and inputs uniformly drawn from 
X e [0, 1]''. Parameters are as in Figure ||(left): Dimension d = 4, length scale I = 0.3 and noise 
level (T^ = 0.05. The curves Sim (true learning curve, from numerical simulations), UC, LC and 
TWO are also as in Figure |(left). The curve labelled "Deg" shows the exact learning curve for the 
degenerate scenario (outputs for different inputs are uncorrelated) with exactly the same spectrum 
of eigenvalues Xi of the covariance function (and uniform prior variance C{x,x)). The curves Sim 
and Deg differ significantly, showing that learning curves cannot be predicted reliably based on 
eigenvalues alone. 



If Aq, <C tr A, then we can make the reduction in generalization error arbitrarily small. It thus 
follows that there is no non-trivial upper bound on learning curves that takes only the eigenvalue 
spectrum of the covariance function as input. [Accordingly, the two non-asymptotic upper bounds 
(WV and TWO) discussed in Sec. || both contain other information, via the weighted averages 
of C^ix) = C^(0,a;) in (|2l] ) and the average of C{x,x)(j)'^{x) in (|24|).] In particular, this implies 
that our UC approximation cannot be an upper bound (even though the results for all scenarios 
investigated above suggested that it might be). Furthermore, our result shows that lower bounds 
on the generalization error {e.g. the OV or Plaskota bounds) can be arbitrarily loose. A similar 
observat ion holds for upper bounds on the (data set-averaged) training error et : Opper and Vivarelli 



showed ( Opper and Vivarelli, 199£ ) that their eov actually also provides an upper bound for this 
quantity (so that the two errors 'sandwich' the bound, Ct ^ eov ^ e)- In the present case it is easy 
to calculate et explicitly; we omit details and just quote the result 

et ~ r- < K 

(T^ + nXa 

By taking Aq, <C tr A, one then sees that the upper bound eov can be made arbitrarily loose for 
any fixed n (and that the ratio of training and generalization error can be made arbitrarily small). 

One may object that the above limit of most of the Pa tending to zero is unrealistic because it 
implies that the corresponding prior variances Cq. = \a/pa would become very large. Let us therefore 
now restrict the prior variance to be uniform, Cq, = c. It then follows that Aq = c/pa and hence 
Pa = Ag/tr A. With this assumption, only the Aq, and remain as parameters affecting the learning 
curve (^o|). The results for an eigenvalue spectrum from one of the situations covered in Sec. [sjare 
shown in Figure ^ The main conclusion to be drawn is that the learning curves for the present 
scenario are quite different from the ones we found earlier, even though the eigenvalue spectra and 
noise levels are, by construction, precisely identical. This demonstrates that theoretical predictions 
for learning curves which only take into account the eigenvalue spectrum of a covariance function 
cannot universally match the true learning curves with a high degree of accuracy; the quality of 
approximation will vary depending on details of the covariance function and input distribution that 
are not encoded in the spectrum. Note that Figure |^ also provides a concrete example for the fact 
that the UC approximation is not in general an upper bound on the true learning curve; in fact, it 



16 



10' 



10" 




O^unif a"=0.05 
^^unif ct'=0.001 
Gauss a =0.05 
^ Gauss a'=0.001 



100 



200 



300 



Figure 10: Comparison of the Plaskota bound (solid lines) and the lowest generalization error 
achievable for single data sets of size n within the degenerate scenario (dashed lines). The eigenvalue 
spectra used to construct the curves are those for an RBF covariance function with length scale 
I = 0.3, in c? = 4 dimensions, and for the input distributions (uniform over [0, 1]'', or Gaussian) 
shown in the legend; the noise level is also given there. Note that for a given n, the curves become 
closer for lower a^; this is as expected since for cr^ ^ the Plaskota bound can be saturated for a 
specific data set (see text). 



here underestimates the true e{n) quite significantly. 

We can also use the present scenario to assess whether, as a bound on the generalization error 
resulting from a single training set, the Plaskota bound (19) could be significantly improved. We 
focus on the case of uniform covariance functions, where (29) becomes 

a 

For any assignment of the Ua there is at least one training set of size n ~ Ua for which the 
generalization error is given by (^2[ ). Minimizing numerically over the tIq, for each given n, we 
find the curves shown in Figure |lO|, where the Plaskota bounds for the same eigenvalue spectra are 
also shown. The curves are quite close to each other, implying that the Plaskota bound cannot 
be significantly tightened as a single data set bound (assuming, as throughout, that the improved 
bound would again only be based on the covariance function's eigenvalue spectrum). In the limit 
cr^ — > 0, the bound — which then reduces to the MW bound ( |T8| ) — cannot be tightened at all, as 
setting TT-Q = 1 for a = 1 . . . n and tIq, = for a > n + 1 in ( |32| ) shows. 

Within the simple degenerate scenario introduced in this section, we finally comment briefly on 



a universal relation recently proposed by Malzahn and Opper (Malzahn and Oppcr, 2001). They 
suggest considering an empirical estimate of the (Bayes) generalization error, which is obtained by 
replacing the average over all inputs x by one over the n training inputs Xi : 



1 " 

ecmp{D) = - ^ e{xi,D) 



n 

1=1 

Within the approximations of their calculation, the data set average of this quantity is then uni- 
versally linked to a modified version of the true generalization error: 

(ee.p(D)),^(^4^)^ (33) 

Note that the average over data sets is on the 'inside' of the fraction on the right hand side, through 
the definition of €{x) — {e{x, D)) j^. Within our degenerate scenario, we can calculate both sides 
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of (|33| ) explicitly, but find no obvious relation between the two sides. However, if we move the data 
set average on the right hand side to the outside, we do (after a brief calculation, the details of 
which we omit) find a simple result: 

As indicated by the subscripts, the left hand side of this relation is to be evaluated for data sets of 
size n + 1 rather than n. The result ( p^ is remarkable in that it holds for any eigenvalue spectrum 
and any input distribution (within the degenerate scenario considered here). We take this as a 
hopeful sign that some universal link between true and empirical generalization errors, along the 



lines derived by (Malzahn and Opper, 2001) within their approximation, may indeed exist. 



6 Conclusion 

In summary, we have derived an exact representation of the average generalization e error of Gaus- 
sian processes used for regression, in terms of the eigenvalue decomposition of the covariance func- 
tion. Starting from this, we obtained two different approximations (LC and UC) to the learning 
curve e(n). Both become exact in the large noise limit; in practice, one generically expects the 
opposite case {a^/C{x,x) <C 1), but comparison with simulation results shows that even in this 
regime the new approximations perform well. 

The LC approximation in particular represents the overall shape of the learning curves very 
well, both for 'rough' (OU) and 'smooth' (RBF) Gaussian priors, and for small as well as for large 
numbers of training examples n. It is not perfect, but does generally get substantially closer to 
the true learning curves than existing bounds (two of which, due to Plaskota and to Williams and 
Vivarelli, we generalized to a wider range of scenarios). For situations with non-uniform input 
distributions the predictions of the LC approximation tend to be less accurate, and we linked this 



observation to recent work by Malzahn and Opper (Malzahn and Opper, 2001 ) on the effects of non- 
uniformity across input space. Their result, which reduces to the LC approximation for sufficiently 
uniform scenarios, may in other cases provide better approximations, but this has to be traded off 
against the higher computational cost that would be involved in actually evaluating the predictions. 

We then discussed how the LC approximation could be improved. The effects of discrete n can 
be incorporated to leading order, but were seen to be relatively minor; on the other hand, the second 
approximation involved in the derivation (decoupling of averages) appears difficult to improve on 
within our framework. 

Finally, we investigated a simple "degenerate" Gaussian process learning scenario, where the 
outputs corresponding to different inputs are uncorrelated. This provided us with a means of 
assessing whether there are limits on the quality of approximations and bounds that take into 
account only the eigenvalue spectrum of the covariance function. We found indeed that such 
limits exist: There can be no nontrivial upper bound on the learning curve of this form, and 
approximations are necessarily of limited quality because different covariance functions with the 
same eigenvalue spectrum can produce rather different learning curves. We also found that as a 
bound on the generalization error for single data sets (rather than its average over data sets) the 
Plaskota bound is close to being tight. Whether a tight lower bound on the average learning curve 
exists remains an open question; one plausible candidate worth investigating would be the average 
generalization error of our degenerate scenario, minimized over all possible input distributions for 
a fixed eigenvalue spectrum. 

There are a number of open problems. One is whether a sub-class of CP learning scenarios 
can be defined for which the covariance function's eigenvalue spectrum is sufficient to predict the 
learning curves accurately. Alternatively, one could ask what (minimal) extra information beyond 
the eigenvalue spectrum needs to be taken into account to arrive at accurate learning curves for all 
possible CP regression problems. Finally, one may wonder whether the eigenvalue decomposition 
we have chosen, which explicitly depends on the input distribution, is really the optimal one. On 



the one hand, recent work (see e.g. ( Williams and Seeger, 2000 )) appears to answer this question 
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in the afRrmative. On the other hand, the variabihty of learning curves among GP covariance 
functions with the same eigenvalue spectrum suggests that the eigenvalues alone do not provide 
sufficient information for accurate predictions. One may therefore speculate that eigendecomposi- 
tions with respect to other input distributions {e.g., maximum entropy ones) might not suffer from 
this problem. We leave these challenges for future work. 
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A Solving for the LC approximation 



In this appendix we describe how to solve eqns. (y_5|,^7|) for the LC approximation and its first order 
correction, using the method of characteristic curves. The method applies to partial differential 
equations of the form a df /dx + bdf /dy — c, where / — f{x, y) and a, 6, c can be arbitrary functions 
of X, y, f. Viewing the solution as a surface in x, y, /-space, one can then show ( [John, 197§| ) that 
if the point (a;o, J/Oi /o) belongs to the solution surface then so does the entire characteristic curve 
{x{t),y{t)J{t)) defined by 

^ = a, ^ = b, | = c, {x{0),y{Q)jm^{xo,yoJo) 

The solution surface can then be recovered by combining an appropriate one-dimensional family of 
characteristic curves. 

Denote the generalization error predicted by the LC approximation as eo{n,v), with v the 
auxiliary parameter introduced in (^,^^. It is the solution of the equation ( [T^ ) 

dep _ 1 dep ^ ^ 
dn (T^ + eo dv 

subject to the initial conditions ep{n = 0,v) — tr (A^^ + vl)^^ . These give us a family of solution 
points which the characteristic curves have to pass through, namely (n(0) = 0,v{0) = Vp,ep{0) — 
tr (A~^ + vpt)~'^). The equations for the characteristic curves are 

dn ^ dv 1 dep ^ 

dt ' dt cr^ -I- Co ' dt 
and can be integrated to give 

n = n{0) + t = t, v = v{0)-^ = vp - , eo = eo(0) = tr (A^^ + woI)"M35) 

Eliminating t (the curve parameter) and vp (which parameterizes the family of initial points) gives 
the required solution eo = tr {A^^ + [v + n/{a^ + eo)]I}~^. The LC approximation (|6|) is obtained 
by setting i; = 0. 

For the first order correction ei, we have to solve the equation (^) 

dei 1 dei 1 d'^ep ei dep 

dn (7^ + ep dv 2 dn^ {cr^ + ep)^ dv 

with the initial condition (explained in the main text) ei(n — 0,v) — 0. Hence a suitable family of 
initial solution points is (n(0) = 0, v{0) = vp, ei(0) = 0). The characteristic curves must obey 

dn dv 1 dei 1 d^ep ei dep 



dt ' dt <J^ + ep' dt 2 dn^ {a^ + eo)^ dv 

The solutions for the n{t) and w(i)-dependence are therefore the same as before, and as a result eo 
is again constant along the characteristic curves. For the derivatives of eo that appear, one finds 
after some algebra 

9f^Q , 2 , \2 ^2 d'^ep r,t 2 , ^ 0.2 - 

- ~(a +ep) , —-—^-2[a +ep)- 



dv 1 — na2 ' dn"^ (1 — na2)^ 
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Here we have used the definitions from (psj); because both 02 and 03 only depend on n and v in 
the combination v + n/ (cP' + eo), they are also constant along the characteristic curves. Using also 



that n ^ t from (35), the equation for ei becomes 



de-L 2 , ^ oj-as a2 



dt ' "'(l-ta2)3 "l-iaa 
This linear differential equation is easily integrated; using the initial condition ei(0) = one finds 

/ 2 , ^ ^ ^(Q2 - Q3) 

Eliminating t again via t = n finally gives the solution 



B The Plaskota bound and its extension 



To summarize the proof of Plaskota's lower bound on learning curves ( Plaskota, 1990 ) and generalize 
it to non-uniform covariance functions, we find it most convenient to work with a discrete space 
of inputs x", a = 1 . . . N, with input probabilities P(x") — pa- (The case where inputs can vary 
continuously can be approximated arbitrarily well by such a scenario, either by discretizing the 
input space on a sufficiently fine grid or by taking the x" as a large number TV of samples from 
the input distribution and setting pa = 1/^-) A function 9 on these inputs can then be thought 
of as a vector 6 with elements 9{x"); similarly, the covariance function C{x",x^) = Cap becomes 

an N X N matrix C = (^96^'^. The eigenvalue equation (||) and the corresponding normalization 
condition then read 

^Caf3Pp4'i{x'^) = Xi<j>i{x"), ^(l)i{x°')(l)j{x°')pa = % 

/3 a 

We can write these in a more compact form: If we set A — diag(Ai . . . Xn) and P = diag(pi . . -Pn), 
denote cf)^ the vector with entries (f)i{x°') and $ the N x N matrix whose columns are the cj)^, we 
have 

CP* = *A, *Tp* = I 

and hence 

C = *A*^ 

One can also write this in the form of a more conventional matrix diagonalization: 

pl/2cpl/2 ^ (pl/2$)A(pl/2$)T^ (pl/2$)T(pl/2^) ^ j 

It follows that the are the eigenvalues of the matrix pi/2CP^/^ and that 

(C(x, x))^ = tr PC = tr pi/2cpi/2 = ^ (3g) 

in agreement with (|^). 

Plaskota now considers the case of generalized training sets consisting of n generalized obser- 
vations yi = liJO + rji. Each yi is a linear combinations of the values of the function 9{x), with 
coefficients specified by the vector L/, corrupted by additive Gaussian noise rji which as before we 
take to be of variance a^. The conventional scenario, where each training point contains a (cor- 
rupted) observation of 9{x) at a single point x = x"^^\ corresponds to L; — e^j/). Here we denote 
Bq, the a-th unit vector (with the a-th component equal to one and all others zero) and write the 
l-th training input as x°'^^\ with a{l) £ {1 . . . N}. 

After the observation of the yi, we will have some posterior covariances (^A9{x°')A9{x^)') . Col- 
lecting these into an iV x iV matrix V one can write them explicitly as 

V = C - CL(L'^CL + aHy^I^C = (C^^ + cj-^LL^)-'^ (37) 
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where L is an x n matrix whose columns are the L; . The first version of tliis result can be seen as 
a generalization of (||); the second version is the corresponding generalization!^ of (||) in a different 
guise. It then follows that the generalization error is given by 

e{D) = ((A6l(x"))2) = trPV = trPC - tr [L^CPCL(M + ct^I)"^] (38) 

a 

where we have defined the n x n matrix M = L'^CL. Its trace (which will become important 
shortly) is tr M = LjCLj; for a conventional training set with training inputs a;"'^') this becomes 
trM = eI(,)Ce„(,) = C(x"W,^"(0). 

So far everything is exact. The idea behind Plaskota's bound is now as follows: For any given 
conventional training set with n training inputs, minimize e{D) over all generalized training sets of 
size n which have the same value of trM. The result then clearly gives a lower bound on e{D) for 
the given training set. 

To carry out this minimization, one can proceed as follows. The matrix M is symmetric and can 
be diagonalized, so that M = OrjO^ where O is an orthogonal n x n matrix and is a diagonal 
matrix whose entries rji are the eigenvalues of M. Since M is positive semi-definite, the rji are 
non-negativef^; we also assume them to be ordered in descending order, rji > . . . > rj^. It then 
follows that (M + cr^i)-! = 0(77 + a'^iy^O'^, so that 

e{D) = tr A - tr [0'^L'^CPCL0(r7 + cr^I)"^] 

where we have also used that trPC = trA from (pq). Now, if we define the N x n matrix 
Q = Ci/^LOry-i/^ and write C = C^/^PC^/^, then rp^Cf^CQr]^^^ = QTl'I'CPCLO and so 

n 

e{D) = tr A - tr [Q^CQ-ni^n + aH)-^] = tr A - \"{Q,^CQ)u—^ (39) 

2—1 

To minimize e{D), we need to make the sum over i maximal. If we write the n columns of Q 
as vectors Qi then (Q"^CQ)ii — Q^CQ^; also, from the definition of Q we have Q'^Q = I, 
implying that the Qj are orthonormal. Now it is easy to see that C = C^/^PC^/^ has the same 
eigenvalues as P^/^CP^/^, namely, the A^. Assuming these to be ordered in a descending sequence, 
the largest value that Q^CQ^ can take is therefore Ai. Because of the orthonormality of the Qi, 
one has similarly that J2i<k Qj^Qi ^ J2i<k ^'^^ k < n. Since the terms rji/ {rji + cr^) form 
a descending sequence (due to our assumed ordering of the 77^), it is then clear that the optimal 
situation is the one where Q^CQi, multiplying the largest value 771/(771 + cr'^), has the largest 
possible value Ai, QJCQ2 has the largest possible value subject to this, which is A2, and so on. 
Plaskota indeed proved formally that 

Finally, one can now optimize over the rji, subject to the constraint that we imposed initially, i.e., 
that trM — J^iVt equals the sum S — J2i C(a;"('\ x"('^). This gives the final Plaskota bound 

e{D) > tr A - max V = min V + "E (40) 

i—l / i>n+l 

as given in (^^. Plaskota also proved this bound to be realizable by an appropriate choice of the 
generalized observations L;. His original derivation only applied to uniform covariance functions, 
but the above proof sketch shows that this restriction is not in fact necessary. 

^"Note that the matrices V and Q are related in the same way as C and A; explicitly, one has V = ^Q^^ . 

We actually assume, for simplicity of presentation, that all the rji are nonzero. Equation (p9|) still holds if some 
Tji vanish, but the derivation is then slightly more complicated: One has to replace T7 by a srnaller diagonal matrix 
which contains only the positive eigenvalues and O by the matrix whose columns are the corresponding eigenvectors 
of M. 
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To evaluate the Plaskota bound in practice it is desirable to have a more explicit expression for 
the minimum over the r]i. Introducing a Lagrange multiplier a for the constraint X^i^j = "S*? one 
finds by differentiation of (40) that each positive rji must satisfy 



2,2 ' " = (41) 



while for the vanishing rji the quantity on the left hand side has to be positive, i.e., Xi < acP' . 
Due to the ordering of the \i , one sees from this that at the minimum the first k of the rji will be 
nonzero and the rest will be zero, with k < n a, number to be determined. Assume we know k. 
Then from (|l|) the nonzero r]i are given by 



1/2 



a' 



Using that J2i<k ^» ~ ^"^^ then finds a = (5* + ka'^)/{aJ2j<k '^j^^) ^^'^ hence 



^1/2 S + ka^ _ ^2 



K'',^—^--' (42) 



In order for the value of k that we assumed to be the correct one, this expression needs to be positive 
for i = 1 . . . /c, while for i = fc + 1 . . . n it must be zero or negative (as follows from < aa'^). Due 
to the ordering of the A^, it is sufficient to check whether (E|) gives % > and f^/c+i < 0. The 
k that satisfies these conditions^ gives the minimum in ([40[); using ( ^ ) , the bound can then be 
simplified to 

\i<k J i>k+l 

In the example scenarios for which we evaluated the bound, we found that fc = n in the vast majority 
of cases. A sensible numerical procedure is therefore to check first whether for k — n equation ( p^ ) 
gives rjn > 0. If yes, then k = n gives the required optimum; if no, then k should be decreased until 
the conditions ryfc > and rjk+i < are met. 



C Eigenvalue spectra for the example scenarios 

We explain first how the covariance functions with periodic boundary conditions for x [0, l]'^ are 
constructed. Consider first the case d — 1. The periodic RBF covariance function is defined as 

oo 

C{x,x') = J2 c{x-x' ~r) (43) 

r— — oo 

where c{x — x') = exp[— — x'l"^ /{2P)] is the original covariance function; for the periodic OU case 
we use instead c{x — x') ~ exp(— |x — x'\/l). One sees that for sufficiently small I [l <^ 1), only the 
r = term makes a significant contribution, except when x and x' are within « I of opposite ends 
of the input space (so that either x — x' + \ or x — x' — 1 are of order I). We therefore expect the 
periodic covariance functions and the conventional non-periodic ones to yield very similar learning 
curves, as long as the length scale of the covariance function is smaller than the size of the input 
domain. 

The advantage of having a periodic covariance function is that its eigenfunctions are simple 
Fourier waves and the eigenvalues can be calculated by Fourier transformation. This can be seen 

There is a unique k with this property; this follows because the minimum in ( ^o[ ) is over a convex function of 
the rji and therefore unique. 
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as follows. For the assumed uniform input distribution on [0,1], the defining equation for an 
eigenfunction (j){x) with eigenvalue A is, from 

{C{x,x)(f){x')) ^, — I dx C{x, x')(f>{x') — X<j){x). 



Inserting (43) and assuming that (j){x) is continued periodically outside the interval [0,1], this 
becomes 



/ dx' c{x-x' -r)(l3{x')= / dx' c{x-x')(i){x' -^r)^ / 



c{x — x')4>{x' r) ^ I dx' c{x — x')(j){x') — X(j){x) 

J —oo 

It is well known that the solutions of this eigenfunction equation are Fourier waves (t>q{x) — e^'^'^a; 
for integer (positive or negative) q, with corresponding eigenvalues 

/>oo 

A, = / dx C(x)e^27r«g:. 



The eigenvalues A, are real since c{x) = c{—x) (this follows from the requirement that the covariance 
function C{x^x') must be symmetric). The eigenfunctions for q ^ Q are complex in the form 
given but can be made explicitly real by linearly transforming the pair (f)q{x) and (f)-q{x) into 
(l/\/2)cos(27rga;) and (l/\/2) sin(27r(ja;). 

All of the above generalizes immediately to higher input dimension d. One defines 

C(x, x') = c{x — x' — r) 

r 

where r now runs over all d-dimcnsional vectors with integer components; the argument x — x' — r 
of c(-) is now a d-dimensional real vector. The eigenvalues of this periodic covariance function are 
then given by 

A, = \dx c(x)e-'"« " 



They are indexed by d-dimensional integer vectors q\ the integration is over all real-valued d- 
dimensional vectors x, and q ■ x va the conventional dot product between vectors. Explicitly, one 
derives that for the periodic RBF covariance function 

A, = (2^)''/2?''e-(2-0^lkllV2 
For the periodic OU covariance function, on the other hand, one has 

A, = K,/'^[l + (2^0'll'Zll']"^'+'^/' (44) 

with Kd = 2, 27r, Stt for d = 1,2,3, respectively; for general d, = 7r(''~i)/22'^r((d + l)/2) where 
Vi^z) = (z — 1)! is the Gamma function. 

All the bounds and approximations in principle require traces over the whole eigenvalue spec- 
trum, corresponding to sums over an infinite number of terms. Numerically, we perform the sums 
over all eigenvalues up to some suitably large maximal value Qmax of [[g[[. The remaining small 
eigenvalue tail of the spectrum is then treated by approximating \ \q\ \ as a continuous variable q and 
integrating over it from Qmax to infinity, with the appropriate weighting for the number of vectors q 
in a shell q< \ \q\ \ < q + dq. To check that this procedure gave accurate results, we always verified 
that the numerically calculated tr A agreed with the known value of C{x,x). 

The third scenario we consider is that of a conventional RBF kernel C{x,x') = exp[— [[a; — 
x'[[^/(2P)] with a nonuniform input distribution which we assume to be an isotropic zero mean 



Gaussian, P{x) oc exp[— [[a;[[^/(2tT^)]. The eigenfunctions and eigenvalues are worked out in (Zhu 



et al., 1998 ) for the case d — 1; the eigenvalues are labelled by a non-negative integer q and given 

by 

A, = (1 - 
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where 

p-^ = l + r/2 + ^/r^/4 + r, r = f/al. 

As expected, only the ratio r of the length scales I and ax enters since the overall scale of the inputs 
is immaterial. (To avoid this trivial invariance we fixed cr^ — 1/12; this specific value gives the 
same variance for each component of x as for the uniform distribution over [0, l]'^ used in the other 
scenarios.) For d > 1, this result generalizes immediately because both the covariancc function and 
the input distribution factorize over the different input components ( |Oppcr and VivarcUi, 1999 



Williams and Seeger, 2001). The eigenvalues are therefore just products of the eigenvalues for each 



component, and indexed by a d-dimensional vector q of non-negative integers: 

d 

\ = (45) 

1=1 

One sees that the eigenvalues will come in blocks: All vectors q with the same s — J^i 1i give 
the same Xq. Numerically, we therefore only have to store the different eigenvalues and their 
multiplicities, which can be shown to be|^ {d+s—l)\/[s\{d—l)\]. With this trick so many eigenvalues 
can be treated by direct summation that a separate treatment of the neglected eigenvalues (via an 
integral, as above) is unnecessary. 
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